Note to the Reader¶

  1. To keep this explainer notebook as clean and readable as possible, several longer Python functions have been moved to separate files located in the /utilities folder.

  2. Our full project, including code and website source code, is publicly available on GitHub: https://github.com/Andreas-Holm-2/02467-Project-assignment

  3. Our website is online at https://andreas-holm-2.github.io/02467-Project-assignment/

  4. The datasets used in this project can be accessed here:

Spotify Artist Collaboration Network (Large Dataset):

  • https://www.kaggle.com/datasets/jfreyberg/spotify-artist-feature-collaboration-network?resource=download&fbclid=IwZXh0bgNhZW0CMTEAAR5jxk4FClm9GBFvCklV4Ga5SRwON3ajfqLdOMjFqN1uereEEduffCtjEbkjJg_aem_YUf8t7fIjwbCWiHS_vrVc

US Top 10K Artists and Their Popular Songs (Smaller Dataset):

  • https://www.kaggle.com/datasets/spoorthiuk/us-top-10k-artists-and-their-popular-songs

Motivation

As music plays a significant role in most people’s lives—including our own—it was naturally intriguing for us to explore this domain. The goal of this project is to investigate whether there are any significant differences in the way artists within different music genres collaborate. More specifically, we were interested in examining whether collaboration patterns differ between the two highly collaborative genres: pop and rap. We hypothesize that these patterns do indeed vary, and we aim to explore what this reveals about the culture within each genre. We believe that network science is the perfect tool to help us answer this question. Moreover, we want to compare linguistic styles and themes within each genre.

In this project, we constructed our own dataset by utilizing two different sources. The first was a large dataset containing approximately 156,000 artists and recorded collaborations between them. We found this dataset ideal for our project, as it included the key features we needed for constructing a network: artist names, follower counts (allowing us to filter out smaller or duplicate artists), and genre labels (enabling us to compare genres within the network).

In addition, we used a smaller dataset containing a list of the 10,000 most listened-to artists within the US. This dataset allowed us to narrow our analytical focus, excluding less influential or smaller artists. We intersected this list with the larger dataset to retain only relevant artists. Various additional preprocessing steps were employed to ensure a clean and usable dataset, which are described further below.

Our datasets did not naturally contain any text-data. To address this, we used our knowledge from class to leverage the Genius API in order to collect lyrics for each artist. Our approach was to gather lyrics from each artist’s three most popular songs, as these provide a good representation of their musical style and identity.

In [1]:
# Import of the most central dependicies
import pandas as pd
import matplotlib.pyplot as plt
import ast
import networkx as nx
import community as community_louvain
from collections import Counter, defaultdict
import netwulf as nw
import numpy as np
import re
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0 
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from scipy.stats import chi2
from nltk.tokenize import MWETokenizer
from wordcloud import WordCloud
import math
import pickle 

import sys
sys.dont_write_bytecode = True

Basic stats. Let's understand the dataset better

Data cleaning and preprocessing¶

In [2]:
# Spotify Artist Collaboration Network (Large Dataset)
nodes_df = pd.read_csv('nodes.csv')
edges_df = pd.read_csv('edges.csv')
nodes_df['genres'] = nodes_df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])
nodes_df
Out[2]:
spotify_id name followers popularity genres chart_hits
0 48WvrUGoijadXXCsGocwM4 Byklubben 1738.0 24 [nordic house, russelater] ['no (3)']
1 4lDiJcOJ2GLCK6p9q5BgfK Kontra K 1999676.0 72 [christlicher rap, german hip hop] ['at (44)', 'de (111)', 'lu (22)', 'ch (31)', ...
2 652XIvIBNGg3C0KIGEJWit Maxim 34596.0 36 [] ['de (1)']
3 3dXC1YPbnQPsfHPVkm1ipj Christopher Martin 249233.0 52 [dancehall, lovers rock, modern reggae, reggae... ['at (1)', 'de (1)']
4 74terC9ol9zMo8rfzhSOiG Jakob Hellman 21193.0 39 [classic swedish pop, norrbotten indie, swedis... ['se (6)']
... ... ... ... ... ... ...
156417 2ces6d2YsQP1RpGMYpdFy8 David Urwitz 5470.0 29 [classic swedish pop] NaN
156418 6AeznZajNbXUulT7W4tK5l Darmiko 2022.0 23 [] NaN
156419 3GEijIjrgb4lPe9WtURBzz Katriell 268.0 0 [] NaN
156420 0ldQL0icSoMz9OOZcWG8Zt Yung Fresh 181.0 19 [] NaN
156421 1QZqarAGs0Lxx495oNcBnZ Rakshitha Rao 23.0 24 [] NaN

156422 rows × 6 columns

Explaining the attributes

The notebook uses of the following attributes: spotify_id, name, followers, popularity, genres and chart_hits. Some of these are self-explanatory, however the following needed further clarification:

  • Popularity, a value between 0-100, is a heuristic calculated by spotify that is primarly based on total number of streaming counts, artist chart positions and how recent those are recieved.
  • Genres is the artists collection of areas he/she creates music in. Relevant for this project is discovering many different POP and RAP genres, such as drill-rap and k-pop
  • Chart_hitsis a metric showing how high songs have been rated on different countries hit-lists

Reference: https://developer.spotify.com/documentation/web-api/reference/get-an-artist

In [3]:
# US Top 10K Artists:
artists_us_df = pd.read_csv("most_listened_artists_in_US_dataset.csv", index_col=0)
artists_us_df
Out[3]:
ID Gender Age Country Genres Popularity Followers URI
Name
Drake 3TVXtAsR1Inumwj472S9r4 male 33 CA ['canadian hip hop', 'canadian pop', 'hip hop'... 95 83298497 spotify:artist:3TVXtAsR1Inumwj472S9r4
Post Malone 246dkjvS1zLTtiykXe5h60 male 25 US ['dfw rap', 'melodic rap', 'pop', 'rap'] 86 43130108 spotify:artist:246dkjvS1zLTtiykXe5h60
Ed Sheeran 6eUKZXaKkcviH0Ku9w2n3V male 29 GB ['pop', 'singer-songwriter pop', 'uk pop'] 87 115998928 spotify:artist:6eUKZXaKkcviH0Ku9w2n3V
J Balvin 1vyhD5VmyZ7KMfW5gqLgo5 male 35 CO ['reggaeton', 'reggaeton colombiano', 'trap la... 83 38028010 spotify:artist:1vyhD5VmyZ7KMfW5gqLgo5
Bad Bunny 4q3ewBCX7sLwd24euuV69X male 26 PR ['reggaeton', 'trap latino', 'urbano latino'] 95 77931484 spotify:artist:4q3ewBCX7sLwd24euuV69X
... ... ... ... ... ... ... ... ...
IVAN$ITO 0cPmSFfjpop7imPVWSk2yc NaN 0 NaN [] 20 4249 spotify:artist:0cPmSFfjpop7imPVWSk2yc
Antonello Venditti 3hYLJPJuDyblFKersEaFd6 male 71 IT ["canzone d'autore", 'classic italian pop', 'i... 54 778642 spotify:artist:3hYLJPJuDyblFKersEaFd6
Lea Salonga 1GlMjIezcLwV3OFlX0uXOv female 49 PH ['movie tunes', 'tagalog worship'] 56 179832 spotify:artist:1GlMjIezcLwV3OFlX0uXOv
Vertical Horizon 6Hizgjo92FnMp8wGaRUNTn mixed 29 NaN ['neo mellow', 'pop rock', 'post-grunge'] 48 431277 spotify:artist:6Hizgjo92FnMp8wGaRUNTn
Lakko 48wDYb8J9emrtnfRJvYEdZ male 0 ES [] 21 21571 spotify:artist:48wDYb8J9emrtnfRJvYEdZ

9488 rows × 8 columns

In [4]:
artists_us_name_list = (artists_us_df.index).tolist() # Convert the list of names into list allowing us to find the intersection next
print(artists_us_name_list[:5]) # Security check
['Drake', 'Post Malone', 'Ed Sheeran', 'J Balvin', 'Bad Bunny']

We will now be constructing our dataset as a intersection between the large spotify collaration dataset and artists_us_name_list

In [5]:
nodes_df = nodes_df[nodes_df["name"].isin(artists_us_name_list)]
nodes_df
Out[5]:
spotify_id name followers popularity genres chart_hits
0 48WvrUGoijadXXCsGocwM4 Byklubben 1738.0 24 [nordic house, russelater] ['no (3)']
1 4lDiJcOJ2GLCK6p9q5BgfK Kontra K 1999676.0 72 [christlicher rap, german hip hop] ['at (44)', 'de (111)', 'lu (22)', 'ch (31)', ...
15 3xs0LEzcPXtgNfMNcHzLIP Rockwell 40344.0 58 [] ['us (1)', 'gb (1)', 'at (1)', 'be (1)', 'ca (...
20 2NUz5P42WqkxilbI8ocN76 Vybz Kartel 1026598.0 63 [dancehall, jamaican dancehall, reggae fusion] ['cr (3)', 'pa (1)']
22 4Lm0pUvmisUHMdoky5ch2I Apocalyptica 864846.0 60 [alternative metal, bow pop, cello, finnish me... ['fi (2)']
... ... ... ... ... ... ...
156038 7p5J8SfKU9Rulp7tcA53G8 Jose Merce 182186.0 51 [cante flamenco, flamenco, nuevo flamenco, rumba] NaN
156119 3gJ0f9ov2Vjrbo9RnFFH76 Endor 267.0 2 [scottish indie folk] NaN
156199 7mKmqnXqn1WoEFljKyvAHR 2T FLOW 20.0 29 [] NaN
156220 6kT18gnkVrCz8xJQcrib7L Bhaskar 230894.0 60 [brazilian bass, brazilian edm, brazilian house] NaN
156330 0Wkm45quqfx3NepJpXDvwE Superorganism 225468.0 51 [art pop] NaN

9754 rows × 6 columns

By inspecting the dataframe, we notice there are more artists in the intersection (9,754) than in the dataset containing the most listened artists in the US (9,488). This indicates there are duplicates in the dataset. We will inspect this:

In [6]:
# Sorting for duplicated within the intersection
duplicate_names = nodes_df[nodes_df.duplicated(subset="name", keep=False)].sort_values(by="name")
duplicate_names
Out[6]:
spotify_id name followers popularity genres chart_hits
43592 1ItNxpDdetHb2gyS10HKfF 18 Karat 1644.0 18 [] NaN
13282 5oWFxbBrbk2Mw86PLUg3OZ 18 Karat 292544.0 53 [deep german hip hop, german hip hop, german u... ['at (16)', 'de (30)', 'lu (3)', 'ch (12)']
57139 09ZUMxxU5pgzUF0FtHeGXG 19HUNNID 5.0 1 [] NaN
59493 2Zm4abMQXwcrsM9IWY3AoB 19HUNNID 513.0 31 [thai trap] NaN
156199 7mKmqnXqn1WoEFljKyvAHR 2T FLOW 20.0 29 [] NaN
... ... ... ... ... ... ...
146823 1wEQjEpK6KfE9Et2ZZBfPa Żabson 43.0 4 [] NaN
24780 0FtUnl4AjR5eDa1v77WU0b ปู่จ๋าน ลองไมค์ 138.0 17 [] NaN
95174 23YWwPEvaO5HLqEMgiUXJa ปู่จ๋าน ลองไมค์ 23424.0 32 [] NaN
6081 3dTgjg7lzUGiD3NwcGCK1n 阿冗 44494.0 48 [chinese viral pop, mainland chinese pop] ['my (1)', 'sg (1)', 'tw (4)']
8241 7sD5pBZNNSDMfiF2BvRem7 阿冗 671.0 31 [] ['tw (1)']

1739 rows × 6 columns

As evident from our dataset, there are a significant number of duplicate artist names. This is primarily due to the scale of the Spotify dataset, which contains over 156,000 artists. Another contributing factor is that Spotify does not enforce unique artist names—meaning multiple users, including lesser-known or negligible artists, can share the same name.

For example, take the artist name Drake, which belongs to one of the most popular artists on the platform. However, because names on Spotify are not unique, a relatively unknown user could also appear under the name Drake, causing confusion and leading to duplicates in our dataset.

To resolve this, we implement a simple disambiguation method: for each duplicated name, we retain only the artist with the highest number of Spotify followers, assuming this is the most prominent and relevant entry.

In [7]:
clean_df = nodes_df.sort_values("followers", ascending=False)
clean_df = nodes_df.drop_duplicates(subset="name", keep="first")
clean_df
Out[7]:
spotify_id name followers popularity genres chart_hits
0 48WvrUGoijadXXCsGocwM4 Byklubben 1738.0 24 [nordic house, russelater] ['no (3)']
1 4lDiJcOJ2GLCK6p9q5BgfK Kontra K 1999676.0 72 [christlicher rap, german hip hop] ['at (44)', 'de (111)', 'lu (22)', 'ch (31)', ...
15 3xs0LEzcPXtgNfMNcHzLIP Rockwell 40344.0 58 [] ['us (1)', 'gb (1)', 'at (1)', 'be (1)', 'ca (...
20 2NUz5P42WqkxilbI8ocN76 Vybz Kartel 1026598.0 63 [dancehall, jamaican dancehall, reggae fusion] ['cr (3)', 'pa (1)']
22 4Lm0pUvmisUHMdoky5ch2I Apocalyptica 864846.0 60 [alternative metal, bow pop, cello, finnish me... ['fi (2)']
... ... ... ... ... ... ...
155870 3a9qv6NLHnsVxJUtKOMHvD The Glitch Mob 538974.0 61 [edm, electro house, glitch, glitch hop, indie... NaN
156036 7dh6G6qILmRpUtZU4ZSD4D Trobeats 515.0 9 [] NaN
156038 7p5J8SfKU9Rulp7tcA53G8 Jose Merce 182186.0 51 [cante flamenco, flamenco, nuevo flamenco, rumba] NaN
156220 6kT18gnkVrCz8xJQcrib7L Bhaskar 230894.0 60 [brazilian bass, brazilian edm, brazilian house] NaN
156330 0Wkm45quqfx3NepJpXDvwE Superorganism 225468.0 51 [art pop] NaN

8756 rows × 6 columns

Now, as we can see the intersection contains only 8756 rows (before 9754), meaning we succesfully removed 998 duplicate artists with the least amount of followers. We have now reached our dataset clean_df only consisting of the most listened artists excluding duplicates. This serves as a base graph that will be partitioned into a pop_df and rap_df.

Each artist is mapped to the genre that he/she participates mostly in. The function counts the occurences of "pop" and "rap" in the artists genres property. Then the genre that occurs mostly the artist is mapped to. This will prevent an artist to be present in both dataframes.

In [8]:
from utilities.Network_construction_functions import split_artists_by_primary_genre

pop_df, rap_df = split_artists_by_primary_genre(clean_df, ["pop", "rap"])

print(f'There are {len(pop_df)} artists in the constructed pop network')
print(f'There are {len(rap_df)} artists in the constructed rap network')
There are 4161 artists in the constructed pop network
There are 1149 artists in the constructed rap network

Dataset statistics

In the following section we calculate the Key statistics for the two networks:

  • Number of nodes

  • Number of edges

  • Network density

  • Number of isolated nodes

  • Is connected

  • Number of connected components

  • Size of largest component

  • Average shortest path length

  • Average clustering coefficient

  • Transitivity

  • Top collaborators (determined by largest degree)

In [9]:
from utilities.Network_construction_functions import get_Graph_with_names

G_pop = get_Graph_with_names(pop_df, edges_df, verbose=False)
G_rap = get_Graph_with_names(rap_df, edges_df, verbose=False)
In [10]:
from utilities.Network_statistics import print_network_statistics
In [11]:
print('Pop network statistics')
print_network_statistics(G_pop)
Pop network statistics
Number of nodes: 4161
Number of edges: 13698
Density: 0.0015826909211912818
Number of isolated nodes: 758
Is connected: False
Number of connected components: 801
Size of largest component: 3286
Average shortest path length (largest component): 5.257677282248106
Average clustering coefficient: 0.13024735101867216
Transitivity (global clustering coefficient): 0.1403883617168353

Degree analysis
Average Degree: 6.58
Median Degree: 3.0
Mode Degree: 0
Minimum Degree: 0
Maximum Degree: 162
In [12]:
print('Rap network statistics')
print_network_statistics(G_rap)
Rap network statistics
Number of nodes: 1149
Number of edges: 6860
Density: 0.010401409497123692
Number of isolated nodes: 143
Is connected: False
Number of connected components: 153
Size of largest component: 961
Average shortest path length (largest component): 4.373501994450225
Average clustering coefficient: 0.2818015338203428
Transitivity (global clustering coefficient): 0.32168006480962175

Degree analysis
Average Degree: 11.94
Median Degree: 5.0
Mode Degree: 0
Minimum Degree: 0
Maximum Degree: 104

Examining degree distributions

In [13]:
from utilities.Network_statistics import plot_degree_distribution
from utilities.Network_statistics import plot_degree_distribution_log_log_scale
In [14]:
plot_degree_distribution(G_pop, "POP")
No description has been provided for this image
In [15]:
plot_degree_distribution(G_rap, "RAP")
No description has been provided for this image

The music genre-networks both seem to exhibit a Heavy-tailed distribution of degrees: there are few highly connected nodes. We will plot both in log-log scale as well.

In [16]:
plot_degree_distribution_log_log_scale(G_pop, "POP")
Power-law exponent: 1.913
No description has been provided for this image
In [17]:
plot_degree_distribution_log_log_scale(G_rap, "RAP")
Power-law exponent: 1.380
No description has been provided for this image
In [18]:
from utilities.Network_construction_functions import print_top_collaborators
print("POP-network")
print_top_collaborators(G_pop, 5)
POP-network
Top 5 artists with the most collaborations:
1. R3HAB — 162 collaborations
2. David Guetta — 125 collaborations
3. Tiësto — 113 collaborations
4. Steve Aoki — 102 collaborations
5. Diplo — 88 collaborations
In [19]:
print("POP-network")
print_top_collaborators(G_rap, 5)
POP-network
Top 5 artists with the most collaborations:
1. Gucci Mane — 104 collaborations
2. French Montana — 99 collaborations
3. Future — 98 collaborations
4. Young Thug — 95 collaborations
5. Lil Wayne — 93 collaborations

Discussing the dataset statistics and plots

Our pop Network seem to follow the power-law correlation relative well, with a power-law exponent of 1.913, slightly below the typical 2-3 range for scale-free networks. This indicates our network has a heavy-tailed degree distribution, consistent with many real-world networks where most nodes have few connections while a small number of nodes have many connections.

In comparison, the rap network has a lower power-law exponent of 1.38, indicating an even more skewed distribution. This implies that collaboration in rap is even more heavily concentrated among a few central artists. This suggests a small number of artists drive a large share of the collaborations.

The degree distributions of the pop and rap collaboration networks reveal noteworthy structural differences in how artists collaborate within each genre.

In the pop network, the average degree is 6.58, but the median is only 3 and the mode is 0, suggesting that while a few artists collaborate extensively, the majority are only loosely connected or not connected at all. This is confirmed by the high number of isolated nodes (758 out of 4161) and the fragmented structure with 801 components. The maximum degree of 162 indicates a small number of central pop artists with great influence, acting as hubs in an otherwise sparsely connected network.

In contrast, the rap network is noteably more cohesive. Despite being smaller in size, it shows a higher average degree of 11.94, and a median degree of 5, suggesting that rap artists tend to collaborate more frequently and more broadly than pop artists. The mode is still 0, and there are 143 isolated nodes, but this represents a smaller fraction of the total network compared to pop. The largest component in rap includes 961 out of 1149 nodes — over 80% of the network, which demonstrates a high level of connectivity. Additionally, the maximum degree of 104 still reflects the presence of a central artist as we saw in the pop network.

Overall, the pop network exhibits a more fragmented, with a large amount of disconnected or loosely connected nodes, while the rap network shows a tighter, more integrated structure, where collaboration is more widespread and balanced across the network.

Top collaborators and an interpretation

The top collaborators in both networks highlight interesting patterns. In pop, the most connected artists—R3HAB, David Guetta, Tiësto, Steve Aoki, and Diplo—are all prominent DJs and producers. This aligns with the structure of pop music, where producers often feature a wide range of vocal artists, leading to high collaboration counts.

In contrast, the rap network’s top collaborators: Gucci Mane, French Montana, Future, Young Thug, and Lil Wayne - are all vocal performers. This suggests that in rap, collaboration primarily occurs between performing artists themselves, further verifying the genre’s strong emphasis on features and collaborations.

Tools, theory and analysis

In this section we will cover the following

  • The Louvain algorithm for discovering communities
  • Visualize each of the networks with their communities using Netwulf
  • Zooming in on North American communities
  • Network metrics (modularity and assortativty)
  • Text analysis (wordclouds)

Louvain algorithm

We use the Louvain algorithm to detect communities within the RAP and POP collaboration networks. It works by maximizing modularity, a measure of how densely connected nodes are within communities compared to between them.

Louvain iteratively groups nodes that increase modularity, then compresses communities into single nodes and repeats the process. This makes it both fast and effective, especially for large networks like ours.

It's well-suited for identifying collaborative clusters - such as crews, label groups, or stylistic circles - without needing any prior labels or assumptions.

In [36]:
pop_communities = community_louvain.best_partition(G_pop, random_state=20)
community_list = []
nx.set_node_attributes(G_pop, pop_communities, 'community')

community_sizes = Counter(pop_communities.values())
sorted_communities = sorted(community_sizes.items(), key=lambda x: x[1], reverse=True)
for community_id, size in sorted_communities:
    community_list.append(f"Community {community_id}: {size} nodes")
print(community_list[:10])
['Community 1: 953 nodes', 'Community 6: 388 nodes', 'Community 9: 333 nodes', 'Community 2: 172 nodes', 'Community 16: 156 nodes', 'Community 0: 152 nodes', 'Community 18: 144 nodes', 'Community 25: 139 nodes', 'Community 53: 135 nodes', 'Community 13: 134 nodes']
In [37]:
rap_communities = community_louvain.best_partition(G_rap, random_state=20)
community_list = []
nx.set_node_attributes(G_rap, rap_communities, 'community')

community_sizes = Counter(rap_communities.values())
sorted_communities = sorted(community_sizes.items(), key=lambda x: x[1], reverse=True)
for community_id, size in sorted_communities:
    community_list.append(f"Community {community_id}: {size} nodes")
print(community_list[:10])
['Community 1: 349 nodes', 'Community 2: 191 nodes', 'Community 0: 88 nodes', 'Community 8: 78 nodes', 'Community 4: 65 nodes', 'Community 30: 51 nodes', 'Community 3: 46 nodes', 'Community 18: 27 nodes', 'Community 60: 26 nodes', 'Community 32: 15 nodes']

We will now investigate where the artists in the different communites are.

In [22]:
from utilities.top_community_countries import print_top_community_country_distribution
print("Top 5 POP Communities by Size and Their Country Composition:")
print_top_community_country_distribution(G_pop, artists_us_df)
Top 5 POP Communities by Size and Their Country Composition:
Community 1 (953 artists): US (28%), nan (22%), GB (15%), SE (5%), AU (4%), NL (3%), CA (3%), DE (3%), FR (2%), NO (2%), BE (1%), JP (1%), KR (1%), NZ (1%), IT (1%), IE (1%), DK (1%), JM (1%), ES (0%), RU (0%), EE (0%), PL (0%), BR (0%), IL (0%), CN (0%), PR (0%), LT (0%), RO (0%), IS (0%), CH (0%), MX (0%), FI (0%), VG (0%), AT (0%), GH (0%), TR (0%), AR (0%), GR (0%), AG (0%), XK (0%), CL (0%), CO (0%), TW (0%), NG (0%), MY (0%), ZA (0%), SN (0%), ID (0%), BA (0%), MA (0%), BF (0%), SI (0%)
Community 6 (388 artists): ES (22%), nan (15%), CO (11%), MX (10%), US (8%), DO (4%), PR (4%), AR (3%), VE (3%), PE (3%), IT (2%), CA (2%), PA (2%), CL (2%), GB (2%), FR (2%), BR (1%), JP (1%), DE (1%), KR (1%), UY (1%), CU (1%), GT (1%), NL (0%), HU (0%), AT (0%), HN (0%), TR (0%), NI (0%), GR (0%), CZ (0%), NO (0%), PL (0%), SE (0%), ZA (0%)
Community 9 (333 artists): SE (35%), nan (25%), DK (18%), NO (13%), US (3%), GB (2%), DE (2%), IT (1%), FI (0%), PL (0%), FR (0%), NZ (0%), IS (0%), AU (0%), BE (0%)
Community 2 (172 artists): TW (32%), HK (20%), nan (15%), US (9%), JP (8%), CN (5%), SG (3%), MY (2%), KR (2%), GB (1%), CA (1%), PH (1%), NZ (1%), AU (1%), IT (1%), FI (1%), SE (1%)
Community 16 (156 artists): nan (44%), NL (41%), US (3%), BE (3%), DE (2%), XE (2%), GB (1%), FI (1%), SE (1%), TN (1%), NO (1%), IT (1%), GH (1%), PR (1%), SR (1%)
In [23]:
print("Top 5 Rap Communities by Size and Their Country Composition:")
print_top_community_country_distribution(G_rap, artists_us_df)
Top 5 Rap Communities by Size and Their Country Composition:
Community 1 (349 artists): US (54%), nan (38%), CA (2%), GB (1%), JP (1%), NL (1%), JM (1%), IT (1%), AT (0%), SE (0%), FR (0%), ZA (0%), AR (0%), SK (0%), KR (0%), CM (0%), PR (0%), AU (0%)
Community 2 (191 artists): PR (35%), nan (28%), AR (7%), DO (5%), ES (5%), US (4%), CO (3%), CL (2%), GB (2%), MX (2%), CH (1%), BR (1%), DE (1%), JM (1%), RU (1%), PA (1%), LV (1%), CA (1%), BE (1%), VE (1%), NL (1%)
Community 0 (88 artists): DE (38%), nan (36%), TR (20%), US (2%), FR (1%), NL (1%), JP (1%)
Community 8 (78 artists): nan (46%), PL (36%), DE (5%), NL (4%), CZ (3%), US (3%), JP (1%), SK (1%), HU (1%)
Community 4 (65 artists): nan (42%), IT (40%), US (6%), GB (6%), CA (2%), JP (2%), FR (2%), PR (2%)

In the pop network, the largest community is dominated by mostly U.S. artists, followed by a Spanish-speaking community, then a Scandinavian one, an Asian community, and finally a dutch one.

In the rap network, the largest community is again centered around U.S. artists, followed by a Portuguese-speaking group, a German and Turkish community, while the fourth is mainly polish artists and the fifth communitiy is primarily italian artists.

Visualization of the POP-network

Using the communities calculated using the louvain algorithm we plot the two networks color coded by community

In [24]:
# Visualize pop network

run_netwulf = False # Convience functionality allowing the notebook to be run from top to bottom. If False, Netwulf is not run and an image is simply shown.
                    # Change to True to run Netwulf

if run_netwulf:
    from utilities.Netwulf_plot_functions import netwulf_plot_communities
    from community import community_louvain 

    communities = pop_communities

    colors = ['#e57468', '#68e574', '#7468e5', '#e5d068', '#68d0e5']

    netwulf_plot_communities(G_pop, communities, color_palette=colors, path="Pop_network.pdf",zoom=0.76)

if run_netwulf == False:
    from IPython.display import Image, display

    display(Image('Pop_network.png'))
No description has been provided for this image

Visualization of the RAP-network

In [25]:
# Visualize rap network

run_netwulf = False # Convience functionality allowing the notebook to be run from top to bottom. If False, Netwulf is not run and an image is simply shown.
                    # Change to True to run Netwulf

if run_netwulf:
    from utilities.Netwulf_plot_functions import netwulf_plot_communities
    from community import community_louvain 

    communities = rap_communities

    colors = ['#e57468', '#68e574', '#7468e5', '#e5d068', '#68d0e5']

    netwulf_plot_communities(G_rap, communities, color_palette=colors, path="Rap_network.pdf",zoom=1) # more zoom due a network of about 1/4th of the size

if run_netwulf == False:
    from IPython.display import Image, display

    display(Image('Rap_network.png'))
No description has been provided for this image

Zooming in on North America

Up until now we've considered the network and its communities across countries as an entirety. Therefore, to go more in depth with our network analysis, we zoom in on the North American cluster in both graphs.

This allows us to engage in a more focused network analysis, where country doesn't serve as a confounder for the communities. To do this we locate the corresponding North American clusters in the networks.

In [26]:
from utilities.Network_construction_functions import get_community_subgraph
from utilities.network_analysis import stats_checker, run_parallel, degree_assortativity, modularity, plot_significance 

counts = Counter(pop_communities.values())
most_common_community = counts.most_common(1)[0][0]

G_pop_NA = get_community_subgraph(G_pop, most_common_community, pop_communities)
G_pop_NA_communities = community_louvain.best_partition(G_pop_NA, random_state=10)

counts = Counter(rap_communities.values())
most_common_community = counts.most_common(1)[0][0]

G_rap_NA = get_community_subgraph(G_rap, most_common_community, rap_communities)
G_rap_NA_communities = community_louvain.best_partition(G_rap_NA, random_state=10)

Visualizing the two North American networks

In [31]:
# Visualize POP_NA network

run_netwulf = False # Convience functionality allowing the notebook to be run from top to bottom. If False, Netwulf is not run and an image is simply shown.
                    # Change to True to run Netwulf

if run_netwulf:
    from utilities.Netwulf_plot_functions import netwulf_plot_communities
    from community import community_louvain 

    communities = G_pop_NA_communities

    colors = ['#e57468', '#68e574', '#7468e5', '#e5d068', '#68d0e5']

    netwulf_plot_communities(G_pop_NA, communities, color_palette=colors, path="POP_network_NA.pdf",zoom=1) # more zoom due a network of about 1/4th of the size

if run_netwulf == False:
    from IPython.display import Image, display

    display(Image('Pop_na_network.png'))
No description has been provided for this image
In [33]:
# Visualize POP_NA network

run_netwulf = False # Convience functionality allowing the notebook to be run from top to bottom. If False, Netwulf is not run and an image is simply shown.
                    # Change to True to run Netwulf

if run_netwulf:
    from utilities.Netwulf_plot_functions import netwulf_plot_communities
    from community import community_louvain 

    communities = G_rap_NA_communities

    colors = ['#e57468', '#68e574', '#7468e5', '#e5d068', '#68d0e5']

    netwulf_plot_communities(G_rap_NA, communities, color_palette=colors, path="Rap_network_NA.pdf",zoom=1) # more zoom due a network of about 1/4th of the size

if run_netwulf == False:
    from IPython.display import Image, display

    display(Image('Rap_na_network.png'))
No description has been provided for this image

We've now zoomed in on the North American communities in both networks by creating G_pop_NA and G_rap_NA, allowing us to dive into our analysis

Analyzing the North American networks

To start our network analysis we recompute the number of nodes and average node degree, providing insight to the two subgraphs. This will help us determine how to proceed with the network analysis.

In [ ]:
stats_checker(G_rap_NA, "RAP_NA")
stats_checker(G_pop_NA, "POP_NA")
RAP_NA-graph number of nodes: 349   and average node degree of 18.939828080229226
POP_NA-graph number of nodes: 953   and average node degree of 11.166841552990556

Observing the statistics it is clear that the sub graph G_rap_NA has a much higher average node degree, while simultaneously having three times less nodes compared to G_pop_NA. The difference in size and average node degree plays an important role in the network analysis, since these metrics are used when computing modularity and assortativity respectively. Therefore, we cannot directly compare modularity and assortativity across networks.

Instead we focus our analysis on comparing assortativity and modularity with that of random null models made with thedouble-edge-swap algorithm from week 6. We will examine modularity, degree assortativity and correlation between node-degree and popularity-attribute.

Using the double-edge-swap we create a 1000 null graphs for both networks:

In [ ]:
pop_rnd_networks = []
rap_rnd_networks = []

if False: 
    for i in range(126): 
        pop_rnd_networks += run_parallel(G_pop_NA)
    for i in range(126): 
        rap_rnd_networks += run_parallel(G_rap_NA)
else: 
    with open("1000random_networks_POP.pkl", "rb") as f: 
        pop_rnd_networks = pickle.load(f)
    with open("1000random_networks_RAP.pkl", "rb") as f: 
        rap_rnd_networks = pickle.load(f)

Modularity

We kick off our analysis by examining the Modularity, calculated using the louvain community detection algorithm. Modularity quantify the precense of distinct subgroups / communities, and the strength of these, within a network. We find this appropriate because we seek to investigate how strongly artists form subgroups across the two different genres.

In [ ]:
## POP 
p_mod = modularity(G_pop_NA, G_pop_NA_communities)

p_mods = []
for g in pop_rnd_networks: 
    community = community_louvain.best_partition(g)
    p_mods.append(modularity(g, community))


## RAP 
r_mod = modularity(G_rap_NA, G_rap_NA_communities)

r_mods = []
for g in rap_rnd_networks: 
    community = community_louvain.best_partition(g)
    r_mods.append(modularity(g, community))

print(f"Modularity of POP_NA: {p_mod}, and the average of null models: {np.mean(p_mods)}")
print(f"Modularity of RAP_NA: {r_mod}, and the average of null models: {np.mean(r_mods)}")

plot_significance(p_mods, p_mod, metric_type="Modularity", genre="POP_NA", xrange=(0.2,0.4), color="orange")
plot_significance(r_mods, r_mod, metric_type="Modularity", genre="RAP_NA", xrange=(0.15,0.3), color="red")
Modularity of POP_NA: 0.3556838348801883, and the average of null models: 0.240194517475336
Modularity of RAP_NA: 0.24558531176116505, and the average of null models: 0.17860153082166233
No description has been provided for this image
No description has been provided for this image

Both POP and RAP collaboration networks exhibit clear community structures as indicated by the modularity scores significantly higher than those of their repsective null models. While POP shows slightly higher modularity than RAP, we refrain from concluding anything based on this, since we lack the common ground for comparison.

What can be confidently concluded is that both genres are characterized by meaningful community structures, suggesting that artists tend to cluster into collaboration groups, whether based on label, style or crew

Degree assortatiity

Next we measure degree assortativity, to investigate whether artists with many collaborations - high node degree, tend to collaborate with similarly well-connected artists. Simply put to asses whether popular or central artists prefer to collaborate with eachother.

In [ ]:
p_assort = degree_assortativity(G_pop_NA)
p_assorts = [degree_assortativity(g) for g in pop_rnd_networks]

r_assort = degree_assortativity(G_rap_NA)
r_assorts = [degree_assortativity(g) for g in rap_rnd_networks]

print(f"Degree assortativity of POP_NA: {p_assort}, and the average of null models: {np.mean(p_assorts)}")
print(f"Degree assortativity of RAP_NA: {r_assort}, and the average of null models: {np.mean(r_assorts)}")
plot_significance(p_assorts, p_assort, metric_type="Assortativity", genre="POP_NA", xrange=(-0.2,0.1), color="orange")
plot_significance(r_assorts, r_assort, metric_type="Assortativity", genre="RAP_NA", xrange=(-0.2,0.1), color="red")
Degree assortativity of POP_NA: -0.027179926303306092, and the average of null models: -0.11641794898450765
Degree assortativity of RAP_NA: -0.017366346614197257, and the average of null models: -0.11910496055841951
No description has been provided for this image
No description has been provided for this image

While both genres show significantly less degree disassortativity compared to random networks, the assortativity remains relatively low overall. This likely reflects the nature of music collaborations, where highly active artists frequently collaborate with less connected or emerging artists

Although collaborations between highly connected artists do occur, they are outweighed by the more frequent high-to-low degree collaborations. This imbalance reduces the overall assortativity value, despite the presence of prominent artist-to-artist features.

We were motivated to try computing assortativity by attribute popularity providing a different perspective compared to degree-assortativity, more reliant on centrality of artists. This did not show a significant difference, however, this inspired us to look into how the popularity attribute correlates with node-degree count.

Degree-popularity correlation

Using degree-popularity correlation we want to examine whether the "popularity" attribute correlates well with an artists number of connections, to determine whether their real-world success in each genre is network-driven or externally determined.

In [ ]:
import scipy.stats as stats
G = G_pop_NA
degrees = [G.degree[n] for n in G.nodes]
popularities = [G.nodes[n]['popularity'] for n in G.nodes]
corr, _ = stats.pearsonr(degrees, popularities)
print("POP correlation between popularity and out-degree", corr)

G = G_rap_NA
degrees = [G.degree[n] for n in G.nodes]
popularities = [G.nodes[n]['popularity'] for n in G.nodes]
corr, _ = stats.pearsonr(degrees, popularities)
print("RAP correlation between popularity and out-degree", corr)
POP correlation between popularity and out-degree 0.41641253713377324
RAP correlation between popularity and out-degree 0.6001108855008934

Both POP and RAP show some correlation between popularity and out-degree. However, there is a higher correlation between the two metrics in the RAP network, indicating that popularity is more tied to number of collaborations, suggesting that popularity might be more network driven in the RAP genre. In the POP network the correlation is weaker, which could imply that popularity may be more dependent on external factors, such as marketing, labels or solo success, rather than collaborations.

Text gathering

While the networks provided relevant and interesting analytical points, we seek to further expand our analysis by investigating the semantic differences across the genres.

Since our data doesn't include any artists lyrical data, we decided to gather additional data and further expand our dataset. Researching methods led us to discover the Genius API - geniuslyrics, which allowed us to fetch each artists top 3 songs.

In [ ]:
import lyricsgenius
import concurrent.futures 

token = "SQ5PQucPjwa_LEDaobBQyxewdh80ZB-KhaBau73yASMRNIvnijWprQSi6ZDnUb7N"
genius = lyricsgenius.Genius(token, verbose=False)

During initial testing of the geniuslyrics library we discovered that each request runs slowly leading us to implement our own search function coupled with a parallel running method. This allowed us to speed up the data gathering, but this still showed to be a long and tedious process.

In [ ]:
def search_artists_songs(artist_name): 
    try:
        dic = {"name": artist_name, "lyrics": []}
        result = genius.search_artist(artist_name, max_songs=3, sort="popularity")
        for song in result.songs: 
            dic["lyrics"].append(song.lyrics)
        print("#")
        return dic
    
    except Exception as e:
        print(f"Error fetching artist {artist_name}: {e}")
        return None

def parallel(chosen_list): 
    #Parallel implementation
    for i in range(0, len(chosen_list), 5): 
        print(f"Batch {i//5 + 1} / {len(chosen_list)//5}")

        if i != 0:
            print(f"{(i/5)} / {len(chosen_list)/5 -1}")
        names = chosen_list[i:i+5]

        with concurrent.futures.ThreadPoolExecutor() as executor: 
            tmp = list(executor.map(search_artists_songs, names))
        results += tmp

    df = pd.DataFrame(results)

    with open("Missing1.pkl", "wb") as f: 
        pickle.dump(df, f)
    return df

Selecting all artists names for lyrics gathering

In [ ]:
names = pop_df["name"] + rap_df["name"]
print("Total number of artists for lyrics gathering", len(names))
Total number of artists for lyrics gathering 5310

As this is a process that took over 20 hours we've saved a csv-file all_lyrics.csv containing the lyrics of our artists. 148 artists kept on failing - even after multiple tries, which resulting in a minor decrease in the datasets sizes.

In [ ]:
results = []
#parallel(chosen_list)
all_lyricsdf = pd.read_csv("all_lyrics.csv") #Loading manually since computing this will take many hours otherwise
total_names = pd.concat([pop_df, rap_df])
print("The following artists were not fetched")
total_names[~(total_names["name"].isin(all_lyricsdf["name"].tolist()))]
The following artists were not fetched
Out[ ]:
spotify_id name followers popularity genres chart_hits primary_genre
264 0tzSBCPJZmHTdOA3ZV2mN3 CL 1064466.0 60 [k-pop] ['bg (1)', 'hk (1)', 'id (2)', 'my (4)', 'ph (... pop
277 1n9MS5jDsiVS3x3ArntcnW Lin Hsin Yi 48805.0 35 [mandopop, taiwan pop] ['tw (4)'] pop
391 2h93pZq0e7k5yf4dywlkpM Frank Ocean 9951370.0 83 [hip hop, lgbtq+ hip hop, neo soul, pop] ['us (17)', 'gb (10)', 'au (15)', 'at (1)', 'b... pop
506 19ra5tSw0tWufvUp8GotLo George Michael 3898805.0 71 [dance pop] ['gb (3)', 'ar (1)', 'au (2)', 'be (3)', 'bg (... pop
834 3w8fdvgPp2aKgy125EBgVg 玖壹壹 175195.0 48 [mandopop, taiwan hip hop, taiwan pop] ['tw (12)'] pop
... ... ... ... ... ... ... ...
68017 6ns6XAOsw4B0nDUIovAOUO GZA 442746.0 62 [alternative hip hop, east coast hip hop, gang... NaN rap
87731 2yQf6b8hxahZaT5dHlWaB1 Raekwon 511046.0 68 [alternative hip hop, east coast hip hop, gang... NaN rap
90168 5xkFVS6IwwgHEAaRzXc9U3 Tay Keith 26368.0 57 [memphis hip hop, trap] NaN rap
90452 6FD0unjzGQhX3b6eMccMJe Ghostface Killah 740415.0 67 [alternative hip hop, east coast hip hop, gang... NaN rap
117022 5fSVkkuijRIpxuvSqysYiP Snøw 134698.0 67 [sad lo-fi, sad rap] NaN rap

148 rows × 7 columns

Text analysis

The following section implements and applies all necessary steps for WordClouds using TF-IDF on a graph of artists. It outlines and explains how the process is done with reasoning.

Introduction of TF-IDF¶

TF calculates the fraction of a term in a document.

$$TF(t,d) = \frac{f_{t,d}}{\sum_{t' \in d}(f_{t',d})}$$

where $f_{t,d}$ is the count of term $t$ in document $d$.

IDF is calculated as

$$ \text{IDF}(t, d) = \log \left( \frac{N}{\left| \left\{ d \in D \mid t \in d \right\} \right|} \right) $$

where N is the count of documents in the entire corpus. and $ \left\{ d \in D \mid t \in d \right\} $ is the set of documents containing term t. $\left| \left\{ d \in D \mid t \in d \right\} \right|$ is then the count of documents containing the term t. Summarized IDF returns the log inverse fraction of documents containing the term t in the corpus D.

Combined as the product of the two they give the TF-IDF

$$TFIDF(t,d,D) = tf(t,d)\cdot idf(t,D)$$

This function combines a terms importance within a single document (its internal representation) with its overall significance across the corpus.

Preprocessing of Lyrics Data¶

  • Artists with empty lyrics are kept in the communities however they do not contribute with any lyrics to the representative text of a community, as these represent artists whose lyrics could not be fetched via the Genius API. However, using only lyrics of the remaining artists will prove to be sufficient.

  • The clean_lyrics_and_apply_representative_text function combines all lyrics for each each artist into a single, semantically representative string, that represents the artist's song style.

The lyrics of each song is cleaned in the following ways:

  • Removes metadata such as "Read More\xa0\n" and the first line (often the title).
  • Strips structural tags like [Verse], [Chorus], etc.
  • Removes quoted commentary at the beginning of the lyrics.
  • Filters out non-letter characters, leaving only alphabetic characters and whitespace.
  • Cleans escaped characters (\n, \r)
  • Removes all whitespace.
  • Converts the text to lowercase.

This results in a clean, lowercase string for each artist, ready for semantic analysis.

Tokenization of Representative Lyrics¶

After cleaning the lyrics, each artist’s representative text is transformed into a list of tokens through the following process:

  • Stemming and Stopword Removal: Words are reduced to their stems using a Snowball stemmer. Common English and domain-specific stopwords (e.g., "lyrics", "chorus") are removed.
  • Bigram Detection: Frequent word pairs (bigrams) are extracted using the chi-squared statistic. Only pairs occurring more than 50 times and with p-value < 0.001 are kept.
  • Retokenization: Statistically significant bigrams are merged into single tokens using an MWETokenizer for example "hip_hop".

The result is a structured token representation of each artist that can be used for analysis such as TF-IDF and WordClouds.

Gathering Data For Text Analysis

Before beginning preprocessing we will first retrieve the dataframes necessary for the task at hand.

In [ ]:
print("---------- Cleaned Nodes Dataframe ----------")

display(clean_df.head(2))

print("---------- Lyrics Dataframe ----------")

display(all_lyricsdf.head(2))
---------- Cleaned Nodes Dataframe ----------
spotify_id name followers popularity genres chart_hits
0 48WvrUGoijadXXCsGocwM4 Byklubben 1738.0 24 [nordic house, russelater] ['no (3)']
1 4lDiJcOJ2GLCK6p9q5BgfK Kontra K 1999676.0 72 [christlicher rap, german hip hop] ['at (44)', 'de (111)', 'lu (22)', 'ch (31)', ...
---------- Lyrics Dataframe ----------
name lyrics
0 $hirak ['12 ContributorsMiljonair Lyrics“Miljonair” i...
1 $uicideboy$ ["184 ContributorsTranslationsTürkçeItalianoFr...

Merging the Cleaned Nodes Dataframe and Lyrics Dataframe we obtain the dataframe necessary, called artists_with_lyrics.

In [ ]:
artists_with_lyrics = pd.merge(all_lyricsdf, clean_df, on='name', how='inner') 
artists_with_lyrics["lyrics"] = artists_with_lyrics["lyrics"].apply(ast.literal_eval)

print(f"{len(artists_with_lyrics)} artists remain in the final cleaned subset with lyrics.")

artists_with_lyrics.head(2)
5162 artists remain in the final cleaned subset with lyrics.
Out[ ]:
name lyrics spotify_id followers popularity genres chart_hits
0 $hirak [12 ContributorsMiljonair Lyrics“Miljonair” is... 1uUuCEXmWpKN2YLl5xSoT2 90233.0 57 [dutch hip hop, dutch rap pop] ['be (4)', 'nl (17)']
1 $uicideboy$ [184 ContributorsTranslationsTürkçeItalianoFra... 1VPmR4DJC1PlOtd0IADAO0 4561646.0 85 [dark trap, new orleans rap, underground hip hop] ['us (17)', 'au (1)', 'at (1)', 'bg (3)', 'ca ...

Recall the we are going to perform text analysis on the subgraphs for North America.

G_rap_NA and G_pop_NA

In [ ]:
print(f"The North American rap graph contains {len(G_rap_NA)} nodes.")
print(f"The North American pop graph contains {len(G_pop_NA)} nodes.")
The North American rap graph contains 349 nodes.
The North American pop graph contains 953 nodes.
In [ ]:
from utilities.text_analysis import clean_lyrics_and_apply_representative_text, filter_for_english_language, tokenize

Running Procedure on Rap subgraph for North America

In the following code block the lyrics extraction and tokenization as previously described is executed for the G_rap_NA

In [ ]:
rap_artists_df = artists_with_lyrics.copy()
rap_artists_df_prepared = rap_artists_df.copy()

amounts_of_artists_with_lyrics = 0
artist_names = set(rap_artists_df_prepared["name"])

for node in G_rap_NA.nodes():
    if node in artist_names:
        amounts_of_artists_with_lyrics += 1

print(f"{amounts_of_artists_with_lyrics} out of {len(G_rap_NA.nodes())} artists in the North American rap graph have lyrics available.")

rap_artists_df_prepared = clean_lyrics_and_apply_representative_text(rap_artists_df_prepared)
rap_artists_df_prepared  = tokenize(rap_artists_df_prepared)

rap_artists_df_prepared.head(2)
314 out of 349 artists in the North American rap graph have lyrics available.
Out[ ]:
name lyrics spotify_id followers popularity genres chart_hits representative_text tokens
0 $hirak [12 ContributorsMiljonair Lyrics“Miljonair” is... 1uUuCEXmWpKN2YLl5xSoT2 90233.0 57 [dutch hip hop, dutch rap pop] ['be (4)', 'nl (17)'] een miljonair mn pa zei hij weet dat het goed ... [een, miljonair, mn, pa, zei, hij, weet_dat, h...
1 $uicideboy$ [184 ContributorsTranslationsTürkçeItalianoFra... 1VPmR4DJC1PlOtd0IADAO0 4561646.0 85 [dark trap, new orleans rap, underground hip hop] ['us (17)', 'au (1)', 'at (1)', 'bg (3)', 'ca ... its not fair i found love it made me say that ... [fair, found, love_made, say, get_back, youll_...

Community-Based TF-IDF Analysis and WordCloud Visualization

This analysis explores linguistic patterns across artist communities using TF-IDF scores to extract key terms for each group and visualise them through word clouds. The process involves grouping, transforming, and analysing preprocessed lyrics data.

The following code block imports necessary functions from the utillities module. Most importantly:

prepare_community_texts, get_TF_dataframe, get_IDF_dataframe, get_TF_IDF_dataframe

Which all play part in preparing the use of plot_wordcloud

In [ ]:
from utilities.text_analysis import prepare_community_texts, get_top_n_largest_community_ids, get_TF_dataframe

from utilities.text_analysis import get_IDF_dataframe,get_TF_IDF_dataframe,print_top_5_TF_per_community,print_top_10_tf_idf_tokens

from utilities.text_analysis import  get_community_to_top_n_artists_by_followers, plot_wordcloud, run_tf_idf_wordcloud_analysis

1. Grouping Lyrics by Community¶

We begin by aggregating tokenized lyrics at the community level using the prepare_community_texts() function:

Next, we extract the top 5 largest communities, where "largest" is defined by the number of artists assigned to each community.

In [ ]:
community_counts = Counter(G_rap_NA_communities.values())

print(f"{len(community_counts)} unique communities were identified in the G_rap_NA network.")

rap_community_texts_df = prepare_community_texts(rap_artists_df_prepared, G_rap_NA_communities)

rap_top_5_community_ids = get_top_n_largest_community_ids(G_rap_NA_communities, 5)

rap_top_5_community_texts_df = rap_community_texts_df[rap_community_texts_df["community"].isin(rap_top_5_community_ids)]

rap_top_5_community_texts_df = rap_top_5_community_texts_df.sort_values(
    by="community",
    key=lambda col: col.map({cid: i for i, cid in enumerate(rap_top_5_community_ids)})
)

rap_top_5_community_texts_df["community_size"] = rap_top_5_community_texts_df["community"].map(community_counts)

print("Top 5 communities ranked by member count")

rap_top_5_community_texts_df
6 unique communities were identified in the G_rap_NA network.
Top 5 communities ranked by member count
Out[ ]:
community tokens tokens_length community_size
2 2.0 [singl, teas, octob, via, instagram, post, fea... 79466 97
1 1.0 [interview, fuse, chainz, share, origin, read_... 50023 57
0 0.0 [throughout, track, two, read, ohohoh, yeah_ye... 41029 54
5 5.0 [hrs, know, hitboy, hit, twenti, bitch, wouldn... 38372 53
4 4.0 [love, turn, headphon, littl_bit, yeah, mani, ... 33481 45

2. Calculating TF, IDF, and TF-IDF Scores¶

With the top 5 communities selected, we now calculate word importance within and across communities using Term Frequency (TF), Inverse Document Frequency (IDF), and TF-IDF scores.

In [ ]:
rap_tf_df = get_TF_dataframe(rap_top_5_community_texts_df)
rap_idf_df = get_IDF_dataframe(rap_community_texts_df)
rap_TF_IDF_df = get_TF_IDF_dataframe(rap_tf_df, rap_idf_df)

print("--- Top 5 terms in rap communities based on Term Frequency (TF) ---")
print_top_5_TF_per_community(rap_tf_df)

print("--- Top 5 terms in rap communities based on Term Frequency-Inverse Document Frequency (TF-IDF) ---")
print_top_10_tf_idf_tokens(rap_TF_IDF_df)
--- Top 5 terms in rap communities based on Term Frequency (TF) ---
Community: 0.0, 5 Terms: ['im', 'yeah', 'like', 'bitch', 'nigga']
Community: 1.0, 5 Terms: ['got', 'im', 'yeah', 'like', 'nigga']
Community: 2.0, 5 Terms: ['nigga', 'im', 'like', 'got', 'aint']
Community: 4.0, 5 Terms: ['got', 'like', 'yeah', 'nigga', 'im']
Community: 5.0, 5 Terms: ['im', 'nigga', 'yeah', 'bitch', 'like']
--- Top 5 terms in rap communities based on Term Frequency-Inverse Document Frequency (TF-IDF) ---
Community: 0.0, 10 Terms: ['krippi_krippi', 'think_ill', 'dejo', 'toot', 'thotiana', 'dos_tres', 'quieren', 'lightup', 'citi_bitch', 'si_lo']
Community: 1.0, 10 Terms: ['versac_versac', 'dm_go', 'muh', 'fi', 'caus_fine', 'whip_work', 'take_chanel', 'band_make', 'slumper', 'shutdown']
Community: 2.0, 10 Terms: ['ninetyon', 'knockin_door', 'db', 'ayyy', 'jump_jump', 'dingu', 'well_well', 'hey_ya', 'yap', 'dooo']
Community: 4.0, 10 Terms: ['bangbang', 'sheck_wes', 'hunnit_rack', 'spend_shit', 'chippi', 'psh', 'push_push', 'undivid', 'bustin_wrapper', 'durk_dej']
Community: 5.0, 10 Terms: ['peekaboo_peekaboo', 'black_yellow', 'ik', 'sway', 'sprey', 'mad_hoe', 'dexter', 'ringringringin', 'chyeah', 'op_je']

3. Identifying Top Artists per Community¶

To make the wordcloud plots more informative we identify the top artists by followers in each of the 5 selected communities.

In [ ]:
n=10

rap_communitiy_to_top_artists = get_community_to_top_n_artists_by_followers(rap_top_5_community_ids, G_rap_NA_communities, rap_artists_df, n)

for community_id, top_artists in rap_communitiy_to_top_artists.items():
    print(f"Top {n} most-followed artists in community {community_id}: {', '.join(top_artists)}")
Top 10 most-followed artists in community 2: Lil Wayne, Big Sean, DJ Khaled, Rick Ross, Kid Cudi, Logic, Russ, T.I., The Game, Wale
Top 10 most-followed artists in community 1: Migos, Lil Peep, 2 Chainz, Megan Thee Stallion, Rae Sremmurd, Meek Mill, Quavo, French Montana, Gucci Mane, Fetty Wap
Top 10 most-followed artists in community 0: Post Malone, Lil Baby, DaBaby, Polo G, Tyga, Lil Tjay, Blueface, Tory Lanez, Papa Roach, Rod Wave
Top 10 most-followed artists in community 5: Wiz Khalifa, YoungBoy Never Broke Again, Lil Yachty, Trippie Redd, Chance the Rapper, Lil Tecca, Kevin Gates, Ski Mask The Slump God, Lil Skies, Lil Mosey
Top 10 most-followed artists in community 4: Future, 21 Savage, Kodak Black, Roddy Ricch, YNW Melly, Young Thug, A Boogie Wit da Hoodie, 6ix9ine, Bryson Tiller, Gunna

4. Visualising Community-Specific Language with Word Clouds¶

In the final step, we generate a word cloud for each of the top 5 communities using their most distinctive terms determined by TF-IDF scores.

In [ ]:
rap_communitiy_to_top_3_artists = get_community_to_top_n_artists_by_followers(rap_top_5_community_ids, G_rap_NA_communities, rap_artists_df, 3)

plot_wordcloud(rap_TF_IDF_df, rap_top_5_community_texts_df, rap_communitiy_to_top_3_artists)
No description has been provided for this image

Running Procedure on Pop subgraph for North America¶

In the following code block the same procedure is executed for G_pop_NA, wrapped into a single run function to avoid redundance.

In [ ]:
run_tf_idf_wordcloud_analysis(artists_with_lyrics, G_pop_NA, G_pop_NA_communities)
748 out of 953 artists in the North American pop graph have lyrics available.

8 unique communities were identified in the G_pop_NA network.

Top 5 communities ranked by member count:

 community  tokens_length  community_size
       1.0          56123             144
       2.0          54594             129
       3.0          79796             176
       4.0          51862             132
       5.0          58916             180

--- Top 5 terms in pop communities based on Term Frequency (TF) ---

Community: 1.0, 5 Terms: ['yeah', 'im', 'like', 'got', 'your']
Community: 2.0, 5 Terms: ['your', 'get', 'love', 'im', 'like']
Community: 3.0, 5 Terms: ['get', 'love', 'im', 'got', 'like']
Community: 4.0, 5 Terms: ['like', 'oh', 'your', 'im', 'love']
Community: 5.0, 5 Terms: ['your', 'im', 'love', 'oh', 'like']

--- Top 10 terms in pop communities based on Term Frequency–Inverse Document Frequency (TF-IDF) ---

Community: 1.0, 10 Terms: ['dudu', 'bubbl_butt', 'life_save', 'spoon', 'hahahahaha', 'taki', 'pynk', 'fuckin_problem', 'jolen', 'dududu']
Community: 2.0, 10 Terms: ['prone', 'dddown', 'im_missin', 'yayae', 'go_sleep', 'oowo', 'durk_dej', 'oh_nana', 'shashasha', 'go_bang']
Community: 3.0, 10 Terms: ['wish_right', 'hol_hol', 'slay', 'wid', 'yuh', 'cout', 'blas', 'yallah', 'twerk_twerk', 'dup']
Community: 4.0, 10 Terms: ['dey', 'eheu', 'makeba', 'dum_dum', 'rockaby', 'eheheu', 'ye_ye', 'wo', 'shayo', 'oohe']
Community: 5.0, 10 Terms: ['feral', 'wakan', 'dupdupdup', 'dew_dew', 'depe', 'yide', 'aliv_feel', 'whippin', 'yoo', 'roo']

Top 10 most-followed artists in community 5: Adele, The Chainsmokers, Martin Garrix, Melanie Martinez, TWICE, P!nk, Liam Payne, Stray Kids, Hailee Steinfeld, Rita Ora
Top 10 most-followed artists in community 3: Ariana Grande, Rihanna, Bruno Mars, Maroon 5, Beyoncé, Nicki Minaj, David Guetta, Demi Lovato, ZAYN, Chris Brown
Top 10 most-followed artists in community 1: The Weeknd, Dua Lipa, Marshmello, Camila Cabello, Sia, Calvin Harris, Doja Cat, Sam Smith, Miley Cyrus, Halsey
Top 10 most-followed artists in community 4: Alok, Little Mix, Zara Larsson, James Arthur, Anne-Marie, Tiësto, Hozier, Bastille, Clean Bandit, Jonas Blue
Top 10 most-followed artists in community 2: Billie Eilish, Katy Perry, Cardi B, Khalid, The Neighbourhood, Lewis Capaldi, Lorde, Niall Horan, Troye Sivan, Kehlani

Generating word clouds...

No description has been provided for this image

Interpretting the TF scores, TF-IDF scores and WordClouds

Words rated high by TF scores:

  • Rap:

image.png

  • Pop:

image-2.png

A first glance over the words rated high in rap reveals heavy use of profanity and racial slurs, meanwhile within pop emotional and personal terms such as 'im', 'your' and 'love' are found.

The generel difference in linguistic styles seems to be no surprise and correspond with established expectations of rap and pop.

Words rated high by TF-IDF scores

  • Rap:

image-2.png

  • Pop:

image.png

Now glancing over the words ranked high by TF-IDF scores a shift occurs, specifically terms across communities within each of rap and pop now differ in style.

The difference of terms within communities is caused by applying IDF to the equation.

It becomes possible to discern communities within each of rap and pop

WordClouds

Analysing communities based on word clouds and artist groupings proved to be a challenging task. Most of the word clouds and artist groupings across communities are inconclusive when it comes to assigning clear stylistic or thematic labels. In many cases, the most frequent words are dominated by sound effects (e.g., dudu, taki, ye_ye, oohe) rather than meaningful lyrical content.

Additionally, the artist communities themselves are often mixed, including singers from different subgenres.

However, the following 2 communities stands out:

RAP

Community 4 Artists: Future, 21 Savage, Kodak Black, Roddy Ricch, YNW Melly, Young Thug, Gunna, A Boogie Wit da Hoodie, 6ix9ine, Bryson Tiller

Frequent Words:: rack, hunnit, bustin, spend, banger, wrapper, scheck, push, shit, okay

Community Title:: Trap/Street-Oriented Rap.

  • This community shows strong use of street slang.
  • Artists are associated with trap, mumble flows, and dark, melodic beats.
  • Words like bustin, rack, hunnit align perfectly with trap themes.
  • Most of the artists reached to prominence from 2015 and onwards - a newer generation of rap.

POP

Community 3 Artists: Ariana Grande, Rihanna, Bruno Mars, Beyoncé, Nicki Minaj, David Guetta, Demi Lovato, Chris Brown

Frequent Words:: Not very telling...

Community Title:: Pop + R&B

  • This group consists of artists known for blending pop with R&B vocal style.

Results

The analysis of the POP and RAP collaboration networks revealed distinct commmunity patterns across the two genres. Both networks exhibit heavy-tailed degree distribution, consistent with the power-law behaviour of typical real-world social networks. However, the rap network shows a more skewed distribution, with a lower power-law exponent, suggesting that collaboration are more concentrated around a few central artists.

Structually, the pop network is more fragmented, with a large number of isolated nodes, and a lower average degree. On the other hand, the rap network is smaller but more cohesive, with higher average degree and one dominant giant component covering a 80% of the nodes, indicating that RAP artists tend to collaborate more broadly within a tighter community structure.

These differences are further reflected in the roles of the top collaborators. In the pop network the most connected artists are DJ's and producers whose collaboration networks are shaped by featuring many vocalists. In RAP, the most connected figures are vocal performers themselves, aligning with the genre's emphasis on featured verses and mutual artistic involvement.

Zooming in on the North American communitites lead us to exciting network science perspectives. Both RAP_NA and POP_NA exhibit statistically significant community structes, as indicated by modularity scores exceding those of their relative 1000 null models. Although POP_NA showed a slightly higher modularity, we avoid direct comparison due to the differences in network size and density. Additionally both networks showed reduced degree dissortativity relative to the null models, though assortativity remains low overall, which is likely a result of well connected artists collaborating across the popularity spectrum.

Furthermore, the correlation between degree and popularity-attribute is notably stronger in the RAP_NA network, suggesting that success in the rap genre is more closely linked to collaboration activity. In the POP_NA network, the weaker correlation may reflect that popularity in this genre is driven more by external factors such as label promotion or individual success rather than a good network position

The text analysis reveals heavy use of profanity and racial slurs in rap lyrics, whereas pop lyrics are more emotionally driven, with frequent use of personal terms such as 'im'", 'your', and 'love'. Attempts to analyse communities using word clouds were largely unsuccessful due to the presence of sound effects, and gibberish tokens dominating the most frequent words.

Nonetheless, a clear Trap / Street-Oriented Rap community was identified, featuring artists such as Future, 21 Savage, and Kodak Black, and characterized by words like bustin, rack, and hunnit, which align with modern trap themes such as mumble flows and dark, melodic beats.

Additionally, a Pop + R&B community was identified, featuring artists like Ariana Grande, Rihanna, Bruno Mars, and Beyonce, who are known for blending pop production with R&B vocal stylings.

Discussion

Even thought the rap network overall was more cohesive than the pop network, whould we take into account that the pop network was about four times as large, making the comparison between the networks slighly biased.

Comparing the RAP_NA and POP_NA networks to their respective null models, and analyzing the correlation between popularity and degree, revealed relevant and insightful results. These findings provide a deeper understanding of how musical collaboration differs between genres.

That said, a valuable direction for future work would be to redesign the network structure to allow for more meaningful comparisons between genres. In hindsight, creating a single unified network—with genre represented as a node attribute—could have enabled new analytical opportunities and deeper cross-genre insights. However, such a design might also have limited some of the genre-specific analyses presented in our current results.

The text analysis succeded in displaying established expectations, i.e. that rap has heavy use of profanity and racial slurs, whilst pop concerns itself more with emotional themes such as love. Additionally some discertations was possible leading to discerning two communities singing modern Trap / Street-Oriented Rap and Pop + R&B.

To improve the text analysis a couple of things might have been usefull.

One idea could be that the gibberish in the word clouds occurs due to IDF values having too much influence. To address this, the following correction was attempted:

$$TFIDF_{adjusted}(t,d,D) = tf(t,d)\cdot idf(t,D)^{p}$$

however, changing the power scale did not make a significant difference in reaching better conclusions. Therefore a power of 1 is used in the code.

To improve the word clouds, one key could be adjusting the TF-IDF weighting such that TF has more influence. However, this was not successfully achieved in this project. When reviewing words that scored highly based on TF alone, it appeared that fewer of them were gibberish.

Additionally, a better cleaning process for removing sound effects and gibberish might have made a significant difference in word cloud quality. Removing such tokens would almost certainly help.